http://economicsofmalaria.com joebrew@gmail.com
This document contains some basic preliminary analysis of data relating to malaria, IRS, and absenteeism at Maragra Açucar, in Maragra, Mozambique. To explore a specific section, click any of the below tabs.
The “Maragra database” consists of the following 9 tables:
| Table name | Number or rows | Number of columns |
|---|---|---|
| ab | 80988 | 12 |
| ab_panel | 1759100 | 7 |
| bairros | 255 | 6 |
| census | 328283 | 38 |
| clinic | 3403 | 9 |
| clinic_agg | 138 | 8 |
| irs | 3766131 | 4 |
| mc | 11578 | 15 |
| workers | 14195 | 29 |
These tables all constitute three kinds of data: those pertaining to medical outcomes (generally referred to as “clinic” data), those pertaining to worker information (demographic and absenteeism-related), and those pertaining to malaria control activities.
These datasets are all readily available in the (private) maragra R package (access credentials given to collaborators upon joebrew@gmail.com). All datasets except for ab_panel are as is (ie, in their original raw format), except for 4 kinds of changes:
year_month, day_number, etc.).The ab_panel dataset is not raw data; rather, it is an amolgamation of the ab and workers datasets, using absence data from the former with worker elibility dates from the latter, so as to create a “panel” style dataset (ie, one row for each day in which a worker was supposed to work).
The clinic and clinic_agg are similar, but not identical, in form. The former has individual-level data (useful for pairing with absences and demographic census data), whereas the latter is simply the raw counts of cases by nationality over time. Though clinic is much richer and more detailed than clinic_agg, is only covers the period from 2014 through 2016, whereas the latter covers a larger time period (2010-2016).
For a “deep dive” into each kind of data, click the tabs at the top of this page.
In order to reproduce this analysis (and others related to these datasets), follow the instructions on this project’s research compendium page.
Data about “fumigações” (indoor residual spraying or “IRS”) dates from 2011-10-30 through 2017-07-23. The dataset consists of the following fields:
date
insecticida
casas_cobertas
pulverizados
meta_instance_name
unidade
casas_total
month
year
dow
day
day_number
year_month
longitude_aura
latitude_aura
Each row is one fumigation activity. The location/residence key is the unidade column. The casas_cobertas column indicates the number of houses in that unidade that were sprayed, wereas the pulverizados column indicates the number of rooms.
The insecticide used is either DDT or ACT. The below is a breakdown of their respective use.
The respective use over time is as follows:
Below is a table of the same data.
We can examine the date of fumigations to see if they are seasonal vs. randomly/uniformly distributed throughout the year.
The analysis of absenteeism data will rely on a panel-style dataset in which one row exists for each worker-day (for which the worker is estimated to have supposed to work), with columns indicating the outcome (absent or not, sick or not, etc.). The Maragra CRM does not natively store panel style data, so we construct it from a combination of the workers dataset (from the Human Resources department) and the absenteeism dataset. Certain features pertaining to illness are merged from the clinic dataset.
The ab_panel dataset has the following column names:
oracle_number
date
leave_type
leave_taken
absent
absent_sick
unidade
Overall, we have observed 1759100 eligible worker days (the equivalent of 4816 years of human activity!), from the period of 2013-01-01 through 2016-12-31.
Our dataset includes 97770 absences and 1661330 presences, which can be broken down below in percentage terms.
Among absences, the breakdown of absence “type” is as follows.
We calculate an absenteeism rate for any given time period (day, month, etc.) as the number of days not worked divided by the number of days which should have been worked, and multiplied by 100. The below shows the absenteeism rate, by day, for the entire observation period. The size of each dot is a reflection of the number of workers observed at that time.
We again calculate the absenteeism rate, but only for those who absences which are classified as “sick leave”.
Daily absenteeism is problematic in that it introduces a great deal of noise. So, we examine monthly absenteeism rates for both all and sick absences. The below chart shows these metrics, as well as the percentage of monthly absences due to sickness, and the number of worker-days observed. A local regression smoothed line is overlaid to see overall trends.
The below chart is a paneling of (a) type of absenteeism metric (columns) and (b) type of worker (rows).
There are two clinical datasets: clinic and clinic_agg. The former is detailed at the individual level, but the latter covers a slightly wider timespan.
The clinic dataset has the following fields:
date
name
severity
month
year
dow
day
day_number
year_month
The clinic_agg dataset has the following fields:
month
year
group
tested
positive
negative
date
percent_positive
clinic_aggIf we combine all workers, we can examine the total incidence of malaria since 2011.
The below charts show the total number of tests and postive cases, by month, for Mozambican and foreign workers, respectively, at Ilovo-Maragra.
clinic_aggThe below chart shows both the number of positive cases (all workers, in red) and the number of tests, by month.
The below chart shows the same data as above, but convers the number of positive cases to a percentage of all tests, rather than an absolute number.
clinic_aggWe can examine the annual seasonality of positive cases by overlaying all years’ data onto one axis.
The below chart uses the same data as above, but instead of positive cases, it shows the percent of tests which were positive.
If we aggregate and view distributions (via “violin” charts) at the level of the month, seasonality is more apparent.
clinicThe below chart shows the severity of all clinical malaria cases in the Maragra clinic.
The below chart shows severity, but over time.
For our purposes we are analyzing the effect of one intervention (IRS) on 2 outcomes (absence and illness) with many confounders (age, worker type, seasonality, etc.). Our analysis can be visualized formulaically as follows:
\[ \begin{equation} \operatorname{Pr}(\text{Outcome} = 1 \mid \text{X}) = \beta_{0} + \beta_{1} \text{Location} + \beta_{2} \text{Season} + (\beta_3{IRS}*\beta_4{IRS_t} + ... ) \end{equation} \]
Our outcome is probabilistic and binomial (ie, one is either absent or present / infected or not infected). Our demographic confounders (represented by \(...\)) will be a function of iterative model selection. Our intervention (IRS) is not a simple yes/no, but rather the product of whether the residence of the worker in question was treated in the last year, and, if so, the time since treatment (represented above as the interaction term, where where \(_t\) represents time elapsed since commencement of the most recent IRS campaign).
Since the full accounting of confounders would greatly reduce the degrees of freedom of our analysis, we employ propensity score matching to generate a matched sample of workers who are alike in characteristics and time, but not treatment. We do this by first estimating the likelihood of having ever received the intervention, given a worker’s age, sex, department and temporary vs. permanent status. We justify the necessity of this matching by noting that the differences between those workers who received IRS and those who did not (see table 1) are striking and in most cases statistically significant.
| IRS | No IRS | p | |
|---|---|---|---|
| n | 3395 | 10796 | |
| STATUS = Temporary (%) | 3134 (92.3) | 10142 (93.9) | 0.001 |
| DEPARTMENT (%) | 0.001 | ||
| Administrative | 112 (3.3) | 294 (2.7) | |
| Factory | 336 (9.9) | 886 (8.2) | |
| Field | 2947 (86.8) | 9616 (89.1) | |
| AGE (mean (sd)) | 35.34 (10.10) | 36.12 (10.23) | <0.001 |
| SEX = M (%) | 1947 (57.3) | 6478 (60.0) | 0.006 |
| RECEIVED = No IRS (%) | 0 (0.0) | 10796 (100.0) | <0.001 |
Having now demonstrated that are treatment and control groups are qualitatively different (and therefore require either statistical adjustment or a priori matching), we proceed to carry out the matching, using those best practices suggested by Ho et al 2004 “for improving parametric statistical models by preprocessing data with nonparametric matching methods” (Daniel Ho, Kosuke Imai, Gary King, and Elizabeth Stuart (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis 15(3): 199-236. http://gking.harvard.edu/files/abs/matchp-abs.shtml). We emply the nearest neighbor method for identifying those workers from our control group who most resemble those workers in the treatment group.
Our match is a 1-to-1 cut, meaning those control workers who do not resemble those in the treatment group are left out of primary analysis. The below table shows the match results.
| Control | Treated | |
|---|---|---|
| All | 10796 | 3395 |
| Matched | 3395 | 3395 |
| Unmatched | 7401 | 0 |
| Discarded | 0 | 0 |
The following output shows, that the distributions of our numeric variables are now extremely similar.
| Means Treated | Means Control | SD Control | Mean Diff | |
|---|---|---|---|---|
| distance | 0.24 | 0.24 | 0.03 | 0.00 |
| age | 35.34 | 35.23 | 10.07 | 0.11 |
| sexF | 0.43 | 0.40 | 0.49 | 0.02 |
| sexM | 0.57 | 0.60 | 0.49 | -0.02 |
| permanent_or_temporaryTemporary | 0.92 | 0.92 | 0.28 | 0.01 |
| departmentFactory | 0.10 | 0.10 | 0.30 | 0.00 |
| departmentField | 0.87 | 0.86 | 0.35 | 0.01 |
The propensity scores can be visualized below.
Having now created a matched sample of 6790 workers, of which 50% received IRS and 50% did not, we can confidently carry out our analysis on this sample. Since the propensity score matching effectively cancels out demographic differences, our model only need take into account those differences which are not at the person-level. In our case, these include seasonality (defined here by quarter) (later, will add other factors).
For the purposes of this first pass, we “bin” IRS exposure into 5 groups: before IRS (includes IRS > 365 days ago), 180+ days ago, 90-80 days, 60-90 days, and in the last 60 days.
Having estimated our binomial logistic regression model, we examine the odds ratios for absence as a function of our predictive variables.
| Variable | OR | Lower | Upper | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 0.1453572 | 0.1398927 | 0.1509910 |
| days_since000-060 | days_since000-060 | 0.6762597 | 0.6381616 | 0.7162830 |
| days_since060-090 | days_since060-090 | 0.6669767 | 0.6144421 | 0.7229558 |
| days_since090-180 | days_since090-180 | 0.6264072 | 0.5929128 | 0.6615467 |
| days_since180+ | days_since180+ | 0.4103491 | 0.3866648 | 0.4352378 |
| quarter2 | quarter2 | 0.4316940 | 0.4071196 | 0.4575924 |
| quarter3 | quarter3 | 0.3868169 | 0.3654818 | 0.4092827 |
| quarter4 | quarter4 | 0.6866548 | 0.6538714 | 0.7210702 |
We run the same model, but instead of estimating absences, we estimate only the likelihood of sick absences. The results (in form of odds ratios) are below.
| Variable | OR | Lower | Upper | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 0.0122875 | 0.0109890 | 0.0136983 |
| days_since000-060 | days_since000-060 | 0.8171294 | 0.6943435 | 0.9578084 |
| days_since060-090 | days_since060-090 | 1.1444964 | 0.9421297 | 1.3801122 |
| days_since090-180 | days_since090-180 | 1.3149744 | 1.1537264 | 1.4972210 |
| days_since180+ | days_since180+ | 0.5220637 | 0.4425079 | 0.6132913 |
| quarter2 | quarter2 | 0.9261983 | 0.8093301 | 1.0599349 |
| quarter3 | quarter3 | 0.5732323 | 0.4957100 | 0.6623028 |
| quarter4 | quarter4 | 0.4912209 | 0.4236313 | 0.5690253 |
(Absence)
| Variable | OR | Lower | Upper | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 0.1453572 | 0.1398927 | 0.1509910 |
| days_since000-060 | days_since000-060 | 0.6762597 | 0.6381616 | 0.7162830 |
| days_since060-090 | days_since060-090 | 0.6669767 | 0.6144421 | 0.7229558 |
| days_since090-180 | days_since090-180 | 0.6264072 | 0.5929128 | 0.6615467 |
| days_since180+ | days_since180+ | 0.4103491 | 0.3866648 | 0.4352378 |
| quarter2 | quarter2 | 0.4316940 | 0.4071196 | 0.4575924 |
| quarter3 | quarter3 | 0.3868169 | 0.3654818 | 0.4092827 |
| quarter4 | quarter4 | 0.6866548 | 0.6538714 | 0.7210702 |
(Sick absence)
| Variable | OR | Lower | Upper | |
|---|---|---|---|---|
| (Intercept) | (Intercept) | 0.0122875 | 0.0109890 | 0.0136983 |
| days_since000-060 | days_since000-060 | 0.8171294 | 0.6943435 | 0.9578084 |
| days_since060-090 | days_since060-090 | 1.1444964 | 0.9421297 | 1.3801122 |
| days_since090-180 | days_since090-180 | 1.3149744 | 1.1537264 | 1.4972210 |
| days_since180+ | days_since180+ | 0.5220637 | 0.4425079 | 0.6132913 |
| quarter2 | quarter2 | 0.9261983 | 0.8093301 | 1.0599349 |
| quarter3 | quarter3 | 0.5732323 | 0.4957100 | 0.6623028 |
| quarter4 | quarter4 | 0.4912209 | 0.4236313 | 0.5690253 |